Outline


Context

Topology is the study of space and spaces. Pure topology asks questions about the generic properties of space, invariant under continuous deformation… there is no distance, no angles, no calculus.

Fig.1: A donut and a coffee cup are the same to a topologist.

Fig.1: A donut and a coffee cup are “the same” to a topologist.

Fig.2: Proof from a paper I wrote.

Fig.2: Proof from a paper I wrote.


Topological data analysis (TDA) is topology applied to understanding “the shape of data”. For example, how many connected components does the dataset have? How many holes, of what dimensions?

My favorite introductory TDA paper is Topology and Data, by Gunnar Carlsson, a Stanford math professor who is a co-founder of Ayasdi, the only TDA-centric data analysis company (that I know of).

A non-technical paper with several examples of real-world applications of TDA is Extracting Insights from the Shape of Complex Data Using Topology, by P. Y. Lum et al.


Examples of applications; what is TDA good for?

Zero-dimensional persistent homology can tell you what k to use for k-means clustering.

One-dimensional persistent homology can detect cycles in a network, or a dynamic predatory-prey or position-velocity equilibrium.

Higher-dimensional persistent homology detects more complex global structure in the data.

TDA is holistic/global, so you don’t need to know what you’re looking for.

It is (or can be) robust to noise and deformations, so is somewhat scale-independent.


Fig. 3: Muthu Alagappan, Stanford. 2010-2011. 452 NBA Players, 7 normalized statistics

Fig. 3: Muthu Alagappan, Stanford. 2010-2011. 452 NBA Players, 7 normalized statistics



Fig.5: M Nicolau et al. PNAS, 2011.

Fig.5: M Nicolau et al. PNAS, 2011.




Implementations

There is an R package called TDA – very handy but not the fastest probably.

The fastest implementations of TDA algorithms are in C++. There’s a short introduction to TDA and thorough comparison of algorithms and software in this paper.


Explanation and demonstration in 2D

Fig.8: The persistent homology pipeline.

Fig.8: The persistent homology pipeline.

I wrote a nifty Processing sketch to explain what a persistent homology barcode is…


Three examples with real datasets

Here are some examples from my own work.


A. Clustering of Chicago’s 77 Community Areas

[Based on this report.]

The City of Chicago divides itself into 77 “Community Areas”. The city’s data website has a nice dataset (downloaded from here, codebook here) that contains information on 27 different public health and economic factors for each of the 77 areas. It has information on 21 health factors: fertility rates, cancer rates, lead poisoning, STD rates, etc., as well six economic factors: “Below.Poverty.Level”, “Crowded.Housing”, “Dependency” “No.High.School.Diploma”, “Per.Capita.Income”, and “Unemployment”.

Ignoring labels, we have 77 points in 27-dimensional space. Here are the column names. We want to cluster according to 3-23, and see where these clusters live in economic space (24-29).

##  [1] "Community.Area"                            
##  [2] "Community.Area.Name"                       
##  [3] "Birth.Rate"                                
##  [4] "General.Fertility.Rate"                    
##  [5] "Low.Birth.Weight"                          
##  [6] "Prenatal.Care.Beginning.in.First.Trimester"
##  [7] "Preterm.Births"                            
##  [8] "Teen.Birth.Rate"                           
##  [9] "Assault..Homicide."                        
## [10] "Breast.cancer.in.females"                  
## [11] "Cancer..All.Sites."                        
## [12] "Colorectal.Cancer"                         
## [13] "Diabetes.related"                          
## [14] "Firearm.related"                           
## [15] "Infant.Mortality.Rate"                     
## [16] "Lung.Cancer"                               
## [17] "Prostate.Cancer.in.Males"                  
## [18] "Stroke..Cerebrovascular.Disease."          
## [19] "Childhood.Blood.Lead.Level.Screening"      
## [20] "Childhood.Lead.Poisoning"                  
## [21] "Gonorrhea.in.Females"                      
## [22] "Gonorrhea.in.Males"                        
## [23] "Tuberculosis"                              
## [24] "Below.Poverty.Level"                       
## [25] "Crowded.Housing"                           
## [26] "Dependency"                                
## [27] "No.High.School.Diploma"                    
## [28] "Per.Capita.Income"                         
## [29] "Unemployment"

To determine the number of clusters, look at the persistence barcode.

So try using k = 3 or k = 5, but not k = 4.

First we cluster with k = 3 (using only the health data columns). Two important economic factors are Per.Capita.Income and Unemployment. The next plot shows how the three clusters occupy this space. It does seem that they maintain their clusters.

And the same plot using k=5 clusters:


B. Cycles and bubbles from delayed oscillations

[Based on this report.]

The 2000 Science article “Crossing the Hopf Bifurcation in a Live Predator-Prey System” by Fussmann et. al. measured predator-prey population dynamics modulated by a third parameter. It recorded population fluctuations of planktonic rotifers (Brachionus calyciflorus, the predators) and green algae (Chlorella vulgaris, the prey) over runs of 50-250 days in carefully stabilized chemostats. The nutrient nitrogen was also added to the systems and held at a constant. The dilution rate delta (the fraction of the system’s volume that was replaced daily) was varied in each of the trial runs.

The study found that for extreme values of delta the populations were static and stable – either dying off or converging to constants – but in an intermediate range the populations exhibited the delayed oscillations that are classic dynamics in predator-prey systems.

All the data at once

Slice by delta, subset days

After hunting for smaller intervals of days, the dataset is now 220 by 3.


C. Clustering of Indian households

[Based on this report.]

I’ve been playing around with questionnaire data from India’s National Family Health Survey 3, conducted 2005-2006. One thing I looked at was how to cluster the households based on their characteristics – is there electricity, a TV, a computer, chickens, ox-drawn carts, glass windows, etc?

Here are some persistence diagrams. These came from a 993 x 44 dataset and took my laptop 7.5 minutes to generate.